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^-H Waves are all around us-be it in the form of sound, electromagnetic ra- 

^_^ diation, water waves, or earthquakes. Their study is an important basic 

C/^ tool across engineering and science disciplines. Every wave solver serving 

^ the computational study of waves meets a trade-off of two figures of merit- 

• its computational speed and its accuracy. Discontinuous Galerkin (DG) 

O methods fall on the high-accuracy end of this spectrum. Fortuitously, their 

computational structure is so ideally suited to GPUs that they also achieve 

r-H very high computational speeds. In other words, the use of DG methods 

, 1^ on GPUs significantly lowers the cost of obtaining accurate solutions. This 

■^ article aims to give the reader an easy on-ramp to the use of this technology, 

^O based on a sample implementation which demonstrates a highly accurate, 

GPU-capable, real-time visualizing finite element solver in about 1500 lines 



in 

^ of code. 



1 Introduction, Problem Statement, and Context 



H At the beginning of our journey into high-performance, highly accurate time- 

domain wave solvers, let us briefly illustrate by a few examples how common 
the task of simulating wave phenomena is across many disciplines of science 
and engineering, and how accuracy figures into each of these application 
areas. Consider the following examples: 
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• An engineer needs to understand the time-domain response of an oscil- 
lating structure such as an accelerator cavity. Real-life measurement 
of the desired properties is extremely costly, if it is possible at all. 
Accuracy is important because wrong results may lead to wrong con- 
clusions. 

• A seismic engineer has time-domain data from a sounding using geo- 
phones and needs to model an underground structure, characterized 
by different wave propagation speeds. Doing so requires a solver for 
the 'forward problem', i.e. a code that, given data about the location 
of the sources and wave propagation speeds throughout the under- 
ground domain, can model the propagation of the waves. Accuracy 
is important because these simulations often inform potentially very 
costly enterprises such as drilling or mining. 

• An electrical engineer wants to model the stealth properties of a new 
airplane involving complicated nonlinear materials. Physical proto- 
typing is expensive, and accurate predictions of scattering properties 
help minimize its necessity. 

In the field of time-domain wave simulation, the main competitors of the 
discontinuous Galerkin method include finite-difference, finite- volume and 
continuous finite-element methods. In a nutshell, finite-difference solvers 
have trouble representing complicated geometric boundaries, finite-volume 
methods become very difficult (and very expensive) to implement at a high 
order of accuracjjj and continuous finite-element methods typically assemble 
large, sparse matrices, whose application to a vector is necessarily memory- 
bound and thus unable to make use of the massive compute bandwidth 
available on a GPU. 

In addition, while finite-difference methods have relatively benign imple- 
mentation properties on CPUs Cohen, Micikevicius [2009 , we will see that 



the computational structure of DG methods is even better suited to GPU 
implementation at high accuracy because they largely avoid the wide "halo" 
of outside values that must be fetched in order to apply a large (high-order) 
stencil to three-dimensional volume data. 



This chapter complements an article [Klockner et al. 2009a| which we have 



recently published that, in its spirit, is probably more like the other chapters 



^The order of accuracy refers to the power with which the error decreases as the 
discretization is refined-for example, if the distance between neighboring mesh points is 
halved, a fourth-order scheme would decrease the error by a factor of sixteen. 



in this volume in that it exposes all the technicalities and tricks that have 
enabled us to demonstrate high-speed DG on the GPU. To avoid redundancy 



between [Klockner et al. , 2009a] and this chapter, we have instead chosen to 



focus our treatment here on easing a prospective user's entry into using our 



technology. While Klockner et al. , 2009a| is very technical and not entirely 



suited as an introduction to the subject, in this chapter we will be applying 
a number of simplifications to facilitate understanding and promote ease-of- 
use. 



2 Core Method 

Discontinuous Galerkin (DG) methods for the numerical solution of partial 
differential equations have enjoyed considerable success because they are 
both flexible and robust: They allow arbitrary unstructured geometries and 
easy control of accuracy without compromising simulation stability. Lately, 
another property of DG has been growing in importance: The majority of 
a DG operator is applied in an element-local way, with weak penalty-based 
element-to-element coupling. 

The resulting locality in memory access is one of the factors that enables 
DG to run on off-the-shelf, massively parallel graphics processors (GPUs). 
In addition, DG's high-order nature lets it require fewer data points per 
represented wavelength and hence fewer memory accesses, in exchange for 
higher arithmetic intensity. Both of these factors work significantly in favor 
of a GPU implementation of DG. 

Readers wishing a deeper introduction to the numerical method are referred 



to the introductory textbook Hesthaven and Warburton 2007 . 



3 Algorithms, Implementations, and Evaluations 

3.1 Background Material 

3.1.1 A Precise Mathematical Problem Statement 

Discontinuous Galerkin methods are most often used to solve hyperbolic 
systems of conservation laws in the time domain. This rather general class 



of partial differential equation (PDE) can be written in the form 

^+V..F(g) = /. (1) 

DG methods generally solve the initial boundary value problems (IBVPs) of 
these equations on a bounded domain Q. This means that in addition to 
the PDE M, one needs to specify the finite geometry of interest, an initial 
value of the solution q at an initial time Tq (which we will assume to be 
zero) as well as which (potentially time-dependent) conditions prevail at the 
boundary dO, of the domain. In addition, source terms may be present. 
These are represented in (IT| by /. 

Classes of partial differential equations more general than (IT]) , such as parabolic 
and elliptic equations, can be solved using DG methods. In this chapter, we 
will focus on hyperbolic equations, and for the sake of exposition, on one 
particularly important example of these equations, the second-order wave 
equation in two dimensions. To emphasize the equation's grounding in re- 
ality, we will cast this equation as (the transverse-magnetic version of) the 
linear, isotropic, constant-coefficient Maxwell's equations in two dimensions 
and show the method's development by its example. The equation itself is 
given by 

dE^ dHy dH^ 
dt dx dy 

One easily verifies that this equation can be rewritten into the more well- 
known second order form of the wave equation, 

with c~^ = e/i. For simplicity, we may assume c = e = /i = 1. Together 
with an initial condition as well as perfectly electrically conducting (PEC) 
boundary condition 

Ez{x,t) = Q ondn. 

Observe that no value is prescribed for the magnetic fields Hx, Hy, which 
we leave to obey natural boundary conditions. In terms of the second-order 
wave equation, PEC corresponds to a Dirichlet boundary. 



3.1.2 Construction of the Method 

To begin the discretization of ([2]), we assume that the domain J7 is polyhe- 
dral, so that it may be represented as a union ri = l+J^^^ D^ C M^ consisting 
of disjoint, straight-sided, face-conforming triangles D^. 

We demonstrate the construction of the method by the example of equation 



(2c). We begin by multiplying (2c) with a test function <f) and integrating 
over the element D^: 



dt 



'dV+ f V(^,^yy{-Hy,H^fct)dV. 

JDi. '• V ■' 



F= 



Observe that the vector- valued F indicated here assumes the role of the flux 
F in ([l]). Integration by parts yields 

0=/ ^(pdV-f{-Hy,H,f-V^,^y)ct)dV+f n-{-Hy,H^f(PdS, 

(3) 

where n is the unit normal to Sfi. Now a key feature of the method enters. 
Because no continuity is enforced on H^ and Hj^ between D^ and its neigh- 
bors, the value of H^ and Hy on the boundary is not uniquely determined. 
For now, we will record this fact by a superscript asterisk, denote these 
chosen values the numerical flux, and leave a determination of what value 
should be used for later. 

To revert the so-called weak form (jsl) to a shape more closely resembling the 
original equation (2c), we integrate by parts again, obtaining the so-called 
strong form. 



BE, 
D. dt 



>dV+ f V^,^yy{-Hy,H,f(l)dV 

JDk 

-f n-{-iHy-H;),H,-H;f<pdS (4) 

JdD,. 



where we carefully observe that the boundary term obtained in the last step 
has stayed in place. 

To determine the values of H* , we note that in many cases a simple average 
across neighboring faces, i.e. H* := {H~^ + H~)/2 leads to a stable and 



accurate numerical method, where H denotes the values on the local face. 
This is termed a central flux. We choose a more dissipative (but less noisy) 



upwind flux Mohammadian et al. 1991|, given by 

+ ny[Hy] 



n 



{F -F*) 



\H^. 



-n^[E.2\ +ahy' 

ny[H^] -fixlHy 



[hx[Hx] 



+ hy[Hy] 



- [Hy]) 



(5) 



The value to be used for h ■ {-{Hy - H*),Hx - H*)'^ in Q can be read 
from the third entry of the right hand side of ([5|, and the first two entries 
apply to equations (2a) and (2b). We have used the common notation 
[q] = Q~ ~ Q~^ for the inter-element jumps, a is a parameter, commonly 
chosen as 1. Obviously, a = recovers a central flux. 

We expand E, H, and 4> into a basis of Np Lagrange interpolation polyno- 
mials li spanning the space P^ of polynomials of total degree A^, where the 
Lagrange interpolation points are purposefully chosen for numerical stabil- 
ity Warburton 2006]. Substituting the expansions into (Q combined with 
([5| yields a numerical scheme that is discrete in space, but not yet in time. 



3.1.3 Implementation Aspects 

To actually implement this scheme, we express ([4]) in matrix form. To do so, 
first note that in our setting, each element D^ C fi can be obtained by an 
affine map ^(r, s) = Ak{r, s)'^ + bk from a reference element I. Now define 
the mass matrix 



M 



V 



klj dV 



\A,\M 



\Ah 



klj dV. 



'Dfe J\ 

\Ak\ is the determinant of the matrix A^. Also let V be the matrix that 
realizes polynomial differentiation along the reference element's z^th axis in 
Lagrange coefficients. Polynomial differentiation along global coordinates 
is realized as a linear combination of these local differentiation matrices, 
according to, e.g.. 



92 



This allows us to express an implementation of the volume part of ([4]): 
= \Ak\M^^^ + \Ak\M{V'^'^'-{-Hy)N + V'''9y{H,)N) 



(6) 



dt 



f h-{-{Hy-H;),H,-H:fcPdS. (7) 
JdDk 



For numerical stability at increasing A^, the matrices A4 and 2? are com- 



puted by ways of orthogonal polynomials on the triangle Dubiner, 1991 



Koornwinder, 1975 



To implement the surface terms, define the surface mass matrix for a single 
face r of the reference triangle I : 



mE 



rc9i 



lilj dS. 



Suppose we compute values oi h • {F — F*) 



n 



[Hy - H*),Hx 



H. 



*\T 



along all faces and concatenate these into one vector. Then the sum over all 
facial integrals may be computed through a carefully assembled matrix: 



/ h-{F-F*)(^dS 

JdD,, 



M' 



Iv T^ 



M' 



Jih ■ {F - F*)\r, 



J,n.{F-F*)\r,y (8) 



We denote this matrix A4 and the vector to which we are applying it f . 
The factors Jn are the determinants of the afiine maps parametrizing the 
faces of Dfc with respect to the faces of I. 

Returning to ([T]), we left-multiply by jA^I^^A^^^ to obtain 



dt 



+ {V' 



k,dx f 



-Hy)^+V'''^^{H,) 



N) 



A.r'M-'M'^i' 



d\ek 



(9) 



Despite all the machinery involved, Q is strikingly simple, consisting of 
three data-local element-wise matrix-vector multiplications (two differenti- 
ations, one combined face mass matrix) and a surface flux exchange term. 
A view of the flow of data is provided by Figure [T| 

Even better, the time derivative ^ g^-^^ occurs on its own, making it possible 
to use simple, explicit Runge-Kutta methods for integration in time. 



3.2 A Minimal Implementation 

After this very quick (but mostly self-contained) introduction to discon- 
tinuous Galerkin methods, we will now discuss how Q and its analogous 
extension to (2a) and (2b) may be brought onto the GPU to form a solver 



for the 2D TM variant of Maxwell's equations. 




Surface Integration 



Local Differentiation 




Figure 1: Decomposition of a DG operator into subtasks. Element-local operations are 
highlighted with a bold outline. 



3.2.1 Introduction 



To make the discussion both more tangible and easier to follow, we have cre- 
ated a simple implementation of the ideas presented here. This implementa- 
tion may be downloaded from the URL http : //tiker . net/gcg-dg-code-download 
As improvements are made, the code at this address may change from 
time to time. The source code may also be browsed on-line at http: 



//t iker.net/gcg-dg- code-browse 



We will begin by briefly discussing the construction of this package. The 
solver is written in Python. We feel that this allows for clearer code that, in 



both notation and structure, closely resembles the MATLAB codes of Hes 



thaven and Warburton [2007 , and yet allows a simple and concise GPU im- 
plementation to be added using, in this case, PyOpenCL (PyCUDA, whose 
use is demonstrated in another chapter of this volume, would have been an- 
other obviously possible implementation choice). In addition, the solver is 
designed for clarity, not peak performance. What we mean here is that the 
compute kernels we show are rather simple and lack a few performance op- 
timizations. The solver's performance is not related to its implementation 
language. High-performance GPU codes can easily be constructed using 
PyOpenCL (and PyCUDA), which is demonstrated below and in a number 
of other chapters of this volume. Finally, we would like to remark that the 
kernels as shown below are optimized for Nvidia CPUs and run well on chips 
ranging from the G80 to the GFIOO. 

In discussing the solver, we focus on the performance-relevant kernels run- 
ning on the GPU. There are other, signiflcant parts of the solver that deal 
with preparation and administrative issues such as mesh connectivity and 
polynomial approximation. These parts are obviously also important to the 
success of the method, but they are beyond the scope of this chapter. The 
interested reader may find them explained more fully in the introductory 



book Hesthaven and Warburton 2007 . Once we have discussed the func 



tioning of the basic solver, we will describe any additional steps that may 
be taken to improve performance. Lastly, we will discuss a set of features 
that may be added to this rather bare implementation to make it more use- 
ful. In the next chapter, we close by showing performance numbers first for 
this solver, and then for our production solver, which is a more complete 
implementation of the ideas to follow. 

3.2.2 Computing the Volume Contribution 

First in our examination of implementation features is what we call the 
"volume kernel", which achieves element-local differentiation as described 
in (p]) and ([7]). The key parts of the kernel's OpenCL C source are given in 
Listing [TJ 

One key objective of this subroutine is the multiplication of the local differen- 
tiation matrices P by a large number of right hand sides, each representing 
degrees of freedom ("DOFs") on an element. This is a suitable point to real- 
ize that matrix- vector multiplication by a large number of vectors is equiva- 
lent to matrix- matrix multiplication by a very fat, moderately short matrix 
that encompasses all elemental vectors glued together. Perhaps the most 
immediate approach to such a problem might be to use Nvidia's CUBLAS. 
Unfortunately, while CUBLAS successfully covers a great many use cases, 
the matrix sizes in question here resulted in uninspiring performance in our 



experiments [Klockner et al. , 2009aj. We are thus left considering the choices 



for a from-scratch implementation. 

In the design of computational kernels for CPUs, perhaps the key defining 
factor is the work decomposition into thread blocks (in CUDA terminology) 
or work groups (in OpenCL terminology). In our demonstration solver, we 
choose a very simple alternative, a one-to-one mapping between elements 
and work groups, and a one-to-one mapping between output degrees of free- 
dom and threads (CUDA) or work items (OpenCL). This choice is simple 
and expedient, but it can be improved upon in a number of cases, as we will 
discuss in Section [3.3.31 

The next key decision is the memory layout of the data to be worked on. 
Again, we make a simple choice and describe possible improvements later. 



As was discussed in Section 3.1.2 the data we are working on consists of Np 



coefficients of Lagrange interpolation polynomials for each of the K elements. 



const int n = get_localJd (0); 
const int k = get_group_id (0); 

int m = n+k*BSIZE; 
int id = n; 

LHx[id] = g_Hx[m]; 
LHy[id] =g_Hy[m]; 
LEz[id] =g_Ez[m]; 

barrier (CLK_LOCAL_IVIEM_FENCE); 

float dHxdr=0,dHxds=0; 
float dHydr=0,dHyds=0; 
float dEzdr=0,dEzds=0; 

float Q; 

for(m=0; m<p_Np; ++m) 

{ 

float4 D = read_imagef(i_DrDs, samp, (int2)(n, m)); 

Q = l_Hx[m]; dHxdr += D.x*Q; dHxds += D.y*Q; 
Q = i_Hy[m]; dHydr += D.x*Q; dHyds += D.y*Q; 
Q = l_Ez[m]; dEzdr += D.x*Q; dEzds += D.y*Q; 



} 



const float drdx = g_vgeo[0+4*k] 

const float drdy = g_vgeo[l+4*k] 

const float dsdx = g_vgeo[2+4*k] 

const float dsdy = g_vgeo[3+4*k] 



m = n + BSIZE*k; 

g_rhsHx[m] = — (drdy*dEzdr+dsdy*dEzds); 
g_rhsHy[m] = (drdx*dEzdr+dsdx*dEzds); 
g_rlisEz[m] = (drdx*dHydr+dsdx*dHyds 
— drdy*dHxdr— dsdy*dHxds); 

Listing 1: OpenCL kernel implementing element-local volume contribution in the discon- 
tinuous Galerkin method, consisting of element-local polynomial differentiation. 
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Observe that, by their being coefficients of interpolation polynomials, they 
each represent the exact value of the represented solution at a point in space 
belonging to a certain element. 

To ensure that each work group can fetch element data in the least number 
of memory transactions, we choose to pad each element up to \Np] ig floating 
point values, where the notation \x~\y represents x rounded up to the nearest 
multiple of y. 

With data layout and work decomposition clarified, we can now examine the 
implementation itself, as shown in Listing [TJ After getting the element num- 
ber k and the number of the elemental degree of freedom n from group and 
local IDs, respectively, first the elemental degrees of freedom for all three 
fields {Hx, Hy, Ez) are fetched into local memory for subsequent multipli- 
cation by the differentiation matrix. 

As indicated above, the work being performed is effectively matrix-matrix 
multiplication, and therefore existing best practices suggest that also fetch- 
ing the matrix into local memory might be a good idea. At least on pre- 
Fermi chips, this does not turn out to be true. We will take a closer look 



at the trade-offs involved in Section 3.3.2 For now, we simply state that 



the matrix is streamed into core through texture memory, and its fetch cost 
amortized by reusing it for not just one, but all three fields {Hx^ Hy, and 

Once the derivatives along each element's axes are computed by matrix 
multiplication, they are converted to global x and y derivatives according 
to ([6|, using separate per-element geometric factors. Finally, the results are 
stored, where our memory layout and work decomposition permit a fully 
coalesced write. 



3.2.3 Computing the Surface Contribution 

The second (and slightly more complicated) part of our sample implementa- 
tion of the DG method is what we call the "surface kernel" , which, as part 
of the same subroutine, achieves both the extraction of the flux expression 
of ([5]) and the surface integration of ([8|. The key parts of the OpenCL C 
source code of the kernel is shown in Listing [2] and continued in Listing [3] 

We use the same one-work-group-per-element work partition as in the previ- 
ous section, and obviously the memory layout of the element data is likewise 
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const int n = get_locaLid (0); 
const int k = get_group_id (0); 



_Jocal float LfluxHx [p-Nafp] 
_Jocal float LfluxHy [p_Nafp] 
_Jocal float LfluxEz [p_Nafp] 



int m; 

/* grab surface nodes and store flux in shared memory ■■ 
if (n < p_Nafp) 

{ 
/* coalesced reads (maybe) */ 
m = 6*(k*p_Nafp)+n; 



const 


int 


idM = 


g_surfinfo 


m] 


m += 


p_Nafp 


int 




idP = 


g_surfinfo 


m] 


m += 


p_Nafp 


const 


float 


Fsc = 


g_surfinfo 


m] 


m += 


p_Nafp 


const 


float 


Bsc = 


g_surfinfo 


m] 


m += 


p_Nafp 


const 


float 


nx = 


g_surfinfo 


m] 


m += 


p_Nafp 


const 


float 


ny = 


g_surfinfo 


m] 







float dHx=0, dHy=0, dEz=0; 
dHx = 0.5f*Fsc*( g-Hx[idP] - g_Hx[idM]) 
dHy = 0.5f*Fsc*( g-Hy[idP] - g_Hy[idM]) 
dEz = 0.5f*Fsc*(Bsc*g_Ez[idP] - g_Ez[idM]): 



const float ndotdH = nx*dHx + ny*dHy; 



LfluxHx [n] = — ny*dEz + dHx — ndotdH*nx; 
LfluxHy [n] = nx*dEz + dHy — ndotdH*ny; 
LfluxEz [n] = nx*dHy — ny+dHx + dEz; 

} 

/* make sure all element data points are cached */ 
barrier (CLK_LOCAL_MEM_FENCE); 

(Continued in Listing^) 

Listing 2: Part 1 of tlie OpenCL kernel implementing inter-element surface contribution 
in the discontinuous Galerkin method, consisting of the calculation of the surface flux of 



HI- 



12 



(Continued from Figure pi) 

if (n < p_Np) 

{ 

float rhsHx = 0, rhsHy = 0, rhsEz = 0; 
int col = 0; 

/* can manually unroll to 3 because there are 3 faces */ 
for (m=0;m < p_Nfaces*p_Nfp;) 

{ 

float4 L = read_imagef(LLIFT, samp, (int2)(col, n)); 

++col; 

rhsHx += L.x*LfluxHx[m]; 
rhsHy += L.x*LfluxHy[m]; 
rhsEz += L.x*LfluxEz[m]; 

++m; 

rhsHx += L.y*LfluxHx[m]; 
rhsHy += L.y*LfluxHy[m]; 
rhsEz += L.y*l_fluxEz[m]; 
++m; 

rhsHx += L.z*LfluxHx[m]; 
rhsHy += L.z*LfluxHy[m]; 
rhsEz += L.z*l_fluxEz[m]; 
++m; 
} 

m = n+k*BSIZE; 

g_rhsHx[m] += rhsHx; 
g_rhsHy[m] += rhsHy; 
g_rhsEz[m] += rhsEz; 
} 

Listing 3: Part 2 of the OpenCL kernel implementing inter-element surface contribution in 
the discontinuous Galerkin method, consisting of the lifting of the surface flux contribution, 
as described in Q. 
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unchanged. It is however important to note that the kernel in question here 
operates on two different data formats during its hfetime. First, the output 
of the flux gather results in a vector of facial degrees of freedom as displayed 
in (Is]). The number of entries in this vector is 'iNfp, where three is the 
number of faces in a triangle, and Nfp = A^ + 1 is the number of degrees of 
freedom required to discretize each face. In general SNfp ^ Np, where we 
recall that Np is the number of volume degrees of freedom. At each of the 
two stages of the algorithm, we employ a design that uses one work item 
per degree of freedom. The number of work items required per work group 
is therefore max(A'p, A'^^p), and at the start of each stage of the algorithm, 
we need to verify whether the local thread number is less than the number 
of outputs required in that stage, to avoid computing (and perhaps storing) 
spurious extra outputs. This is necessary in both stages because either of 
Np or 3Nfp may be larger. 

After fixing the DOF and element indices n and k, the kernel begins by 
allocating 3Nfp degrees of freedom of local storage {Nfp per face) for each of 
the three fields {Hx, Hy, and Ez). This local memory serves as a temporary 
storage for the facial vector of ([8|. Next, index and geometry information is 
read from a surface descriptor data structure called surfinfo. For each facial 
degree of freedom, and hence for each work item, this data structure contains 
the index of the volume degree of freedom the work item processes, as well 
as the index of its facial neighbor (idM and idP, respectively). In addition, 
surfinfo contains geometry information, namely the surface unit normal of 
the face being integrated over (nx and ny), the surface Jacobian divided by 
the element's volume Jacobian (Fsc) and a boundary indicator (Bsc) used for 
the implementation of boundary conditions which takes values of ±1. Based 
on this information, the kernel computes jump terms [Hx], [Hy], [Ez] which 
are then scaled with the geometry scaling Fsc and stored as dHx, dHy, and 
dEz. The computation of the flux expression ([5]) and its temporary storage 
in local memory concludes the first part of the kernel, displayed in Listing 

El 

The second part of the kernel, of Listing [3j is much like local polynomial 



differentiation as discussed in Section 3.2.2 in that it represents an element 



local matrix multiplication. We have applied the same design decisions as 
above for simplicity, mainly based on the facial flux data already being 
resident in local memory. Again, data for the matrix is streamed in using 
the texture units, and the streamed matrix is reused for each of the three 
flelds. One trick we were able to apply here is the three-fold unrolling 
of the loop. This is valid because we know that the combined face mass 
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matrix A4 covers three faces and hence must have a column count that is 
divisible by three. Naturally, the same applies to the lifting matrix C := 
A^~^A^ . The result of this matrix- vector product is then added to global 
destination arrays in which the volume contribution to the right-hand side 
d{Hx, Hy, EzY' /dt is already stored, completing the computation the entire 
right-hand side of ([9|. 

This concludes our description of the basic kernel implementing the compu- 
tation of the ODE right-hand side for nodal discontinuous Galerkin meth- 
ods. What is missing to complete the implementation of the method is a 
simple Runge-Kutta time integrator, which we have implemented using Py- 
OpenCL's built-in array operations. We now proceed to discuss a number 
of ways in which performance of these basic kernels can be improved. 



3.3 Improving Performance 

3.3.1 Avoiding Padding Waste: Data Aggregation 

In the above codes, each element is represented by \Np\ ig floating point val- 
ues for alignment and fetch efficiency reasons. Especially in two dimensions, 
or in three dimensions for elements of relatively small polynomial order A'^, 
this extra padding can be rather inefficient~not just in terms of GPU mem- 
ory use, but especially also in computational resources. All of our kernels 
adopt a one-work-item-per-output design, and hence wasted memory has a 
one-to-one correspondence to wasted computational power. This is all the 
more true once one realizes that Nvidia hardware schedules computations 
in units of 32-wide warps, such that a rounding to 16 has a chance of 50 
per cent of leaving the trailing half-warp of the computation unused. An 
obvious remedy for this problem is the aggregation of multiple elements into 
a single unit. This aggregation represents a trade-off against the work par- 
tition flexibility of all kernels operating on the data, and should therefore 
be chosen as small as possible, while still minimizing waste. 



In [Klockner et al. 2009a , we pursue a compromise strategy, where we 



choose a granularity that combines enough elements so that less than a 
given percentage (e.g. 10) of waste occurs. All further occurring granulari- 
ties are then required to operate integer multiples of this smallest possible 
granularity. To differentiate this granularity from the generally larger work 
group size (or "block size" in CUDA terminology), we have introduced the 
term microblock to denote it. 
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3.3.2 Which Memory for What? 

On-chip memory in a GPU setting is always somewhat scarce, and we foresee 
that this will remain so for the foreseeable future. As already discussed in 



Section 3.2 it is far from clear which portions of the GPU's on-chip memory 
should be used for what data. In discussing this question, we focus on the 
element-local matrix- vector polynomial differentiation as this asymptotically 
(and practically) dominates runtime as the polynomial degree N increases. 



In [Klockner et al. 2009a , we discuss two possible strategies for element- 



local matrix multiplication, the first of which proceeds by loading the matrix 
into local memory, and the second of which loads field data into local mem- 
ory, as we have done above. 

To allow the flexibility of being able to choose which strategy to use for each 
each of the two element-local matrix products (differentiation and lift), the 



surface kernel of Section 3.2.3 may have to be split into its two constituent 
parts, necessitating an extra store-load cycle. We find that this disadvan- 
tage is entirely compensated by the advantage of being able to use a more 
immediately suitable work partition for each part. 

The enumeration of the two strategies begs an immediate question-why is 
the strategy of loading both quantities into local memory not considered? 
The reason for this lies rooted in a number of important practicalities. While 
generic matrix multiplication routines are free to optimize for the case of 
large, square matrices, the matrices we are faced with are small. A generic 
blocking strategy would therefore leave us with many inefficient corner cases 
which would come to dominate our run time. In addition, in the case of 
three-dimensional geometry, the three differentiation matrices of size Np x Np 
exhaust the local memory on Nvidia hardware even for moderate N. 

We find that, at low-to-moderate N and in general in two dimensions, we 
can derive a gain of about 20 per cent by making the matrix-in-local strategy 
available in addition to the field-in-local strategy shown above. Nonetheless, 
the latter strategy has fewer size restrictions than the former, is thus more 
generally applicable, and it successfully uses register and texture memory 
to avoid many redundant matrix fetches. This justifies our choice of the 
strategy in our demonstration code. In these codes, we amortized matrix 
fetch costs by operating on three fields at once. Note that even in a scalar 
(i.e. single-field) case, such amortization is possible, simply by operating on 
multiple elements within the same work item. 
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> Thread 
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Figure 2: Possibihties for partitioning a large number independent work units, potentially 
requiring some preparation, into work groups. Each work unit consists of multiple stages, 
symbolized by different colors. Preparation is shown in a cyan color. An example of this 
would be multiple independent matrix- vector multiplications, each consisting of individual 
multiply-add cycles. 



3.3.3 Rethinking the Work Partition 

Because of the inherent advantages of using one work item per output value, 
the question of work partitioning into work groups and work items on GPU 
hardware is never far removed from that of memory use and data layout, as 
discussed above. The work partition chosen in the demonstration code was 
purposefully simple-one work item per degree of freedom, one work group 
per element. Moving beyond that, while taking into account the lessons of 



Section 3.3.1 one naturally arrives at a partition of one microblock per work 
group. But even this can be further generalized, as one may work on more 
than one microblock in each work group. We assume here that the work on 
each work item is independent, but may depend on some preparation, such 
as fetching matrix data into on-chip memory. This opens up a number of 
possible avenues: One may. . . 

• process microblocks in parallel, adding more work items to each work 
group, to achieve better usage of individual compute units. 

• process microblocks in sequence, leaving the number of work items 
unchanged, but doing more work in each work item, thus amortizing 
preparation work. 

• process multiple work items along with each other, reusing auxiliary 
data (such as matrices) that is already present in machine registers. 
We term this usage ^Hn-line paralleF . 
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• use any combination of the above. 

Figure [2] illustrates these possibilities. 

If a strategy is chosen that exploits the parallel processing of multiple mi- 
croblocks in one work group (the first option) above, subtle questions of 
thread ordering arise that may influence the number of local memory bank 



conflicts. In [Klockner et al. , 2009a , we discuss one such question in more 
detail. 

Note that all combinations of parallel, sequential, and in-line parallel do 
the same amount of work, and should, in theory, require similar time to 
complete. In practice, this is not the case. This begs the question of how to 
decide between the numerous different possibilities. It is of course possible to 
explore manually which combination yields the best performance, but this 
is tedious, error-prone, and needs to be repeated for nearly every change 
to the hardware on which the code is run. This clearly undesirable, but a 
potential solution is described in the next section. 



3.3.4 Using Run-Time Code Generation 



In [Klockner et al. 2009b , we discuss the numerous benefits of being able to 
generate computational code immediately before it is used, i.e. being able to 
perform C-level run-time code generation. We are delighted that OpenCL 
has this capability built into its specification. We do note however that 
the feature can be retrofitted onto CUDA through the use of the PyCUDA 
Python package. In our DG demonstration code, we already make simple 
use of this facility, by using string substitution on the source code of our 
kernels to make certain problem size parameters known to the compiler 
at compilation time. This helps decrease register pressure and allows the 
compiler to use a number of optimizations such as static loop unrolling for 
loops whose trip counts are now known. 

But this is far from the only benefit. Another immediate advantage is the 
ability to perform automated tuning to answer questions such as the one 
raised at the end of the last section, where individual kernels can be gener- 
ated to cover any number of code variants to be tried. Once this has been 
accomplished, implementing automated tuning can be as simple as looping 
over all variants and comparing timing data for each. For larger search 
spaces, a more sophisticated strategy might be desirable. This entire topic 



is discussed in much greater detail in |Klockner et al. 2009b 
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3.3.5 Further Tuning Opportunities 

In the demonstration code, some inefficiency hes buried in the way the sur- 
face fluxes are evaluated. Because the data required by the surface evalu- 
ation grows as 0{N ), whereas the volume data's size grows as 0{N ), 
this is mainly felt at low-to-moderate N, which are particularly relevant for 
practical purposes. 

First, the index data loaded into idM and idP has significant redundancy and 
can easily be compressed by breaking it down into element offsets added to 
one particular entry from a list of subindex lists. This list of subindex lists 
is comparatively small and has better odds of being able to reside in on-chip 
memory. 

Second, data for faces lying opposite to each other is fetched twice-once 
for each side-in our current implementation. Through a blocking strategy 
(which, unfortunately, introduces significant complexity) these redundant 
fetches can be avoided. The strategy is discussed in detail in [Klockner 



et al. 2009a 



The last opportunity for tuning we will discuss in this setting is the use of 
multiple GPUs through MPI or Pthreads. Since only facial data for fiux 
computation needs to be exchanged between GPUs, such a code is cheap in 
communications bandwidth and relatively easy to implement. We will now 
turn our discussion here from opportunities for speed increase to ways of 
making the technology more useful and more broadly applicable. 

3.4 Adding Generality 

GPU-DG as demonstrated in the demonstration code in this chapter can 
be extended in a number of ways to address a larger number of application 
problems. 

Three Dimensions Perhaps the most gentle, but also the most imme- 
diately necessary generalization is the use of three-dimensional dis- 
cretizations. The main complexity here lies in adapting the set-up 
code that generates matrices and computes mesh connectivity. The 
GPU kernels only require mild modification, although a number of 
complexity trade-offs change, requiring different tuning decisions. It 
is further helpful to generate general, n-dimensional code from a sin- 
gle source through run-time code generation, to reduce the amount of 
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code that needs to be maintained and debugged. 

Double Precision We have found that for most engineering problems, sin- 
gle precision calculations are more than sufficient. We do however ac- 
knowledge that some problems do require double precision, and our 
methods can be easily adapted to accommodate it. 

General Boundary Conditions Our demonstration code only provided 
facilities for a single (Dirichlet) boundary condition. In nearly all 
practical problems, multiple boundary conditions (BCs) are needed, 
ranging from Neumann to absorbing BCs to even more complicated 
conditions arising in fluid dynamics. 

General Linear Systems of Conservation Laws In addition to more 
general BCs, one obviously often wants to solve more general PDEs 
than the 2D wave/Maxwell equation discussed here-this might in- 
clude Maxwell's equations in 3D or the equations of aeroacoustics. 
Again, making these adaptations to the demonstration code is rel- 
atively straightforward and mainly entails implementing a different 
local differentiation operator along with a new flux expression. 

Nonlinear Systems of Conservation Laws Once general linear systems 
are treated by GPU-DG, it is, conceptually, not a very big step to also 
treat non-linear systems of conservation laws, such as Euler's equa- 
tions of gas dynamics or the even more complicated Navier-Stokes 
equations, potentially along with various turbulence models. The good 
news is that the methods presented so far again generalize seamlessly 
and work well for simple problems. 

However, due to the subtle subject matter, a number of refinements 
of the method may be required to successfully treat real application 
problems. 

The first issue revolves around the evaluation of non-linear terms on a 
nodal grid and the aliasing error thus introduced into the method. One 
possibility of addressing this is filtering, which is easily implemented, 
but impacts accuracy. Another is over-integration using quadrature 
and cubature rules to more accurately approximate the integrals in- 
volved in the method. A more detailed discussion of these subjects is 



beyond the scope of this chapter and can be found in Hesthaven and 



Warburton, 2007 



Shocks, i.e. the spontaneous emergence of very steep gradients in 
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the solution, are another comphcation that only arises in nonlinear 
problems. Some initial ideas on using GPU-DG in conjunction with 



shock-laden flow computations are available in Klockner et al. , 20111. 



Curved Geometries While finite element methods already ofl^er much greater 
flexibility in the approximation of geometry than solvers on structured 
grids, the demonstration solver shown here is restricted to geometries 
that consist of straight surfaces. A cost-efficient way to extend this 



solver to curved geometries is shown in Warburton, 2008 



Local Time Integration Lastly, as the solvers described in this chapter 
all employ explicit marching in time, time step restrictions may become 
an issue if the mesh involves very small elements. 



Klockner 2010 



Chapter 8] describes a number of time stepping schemes that can help 
overcome this problem. 

As we have seen, a simple solver employing discontinuous Galerkin methods 
on the GPU can be written without much effort. On the other hand, far more 
effort can be spent on performance tuning and adaptation to more general 
problems. A free solver that implements nearly all of the improvements 



described here is available at http://mathema.tician.de/software/hedge 



under the GNU Public License. 



4 Final Evaluation 

In the present chapter, we have shown that, even with limited effort, large 
performance gains are realizable for discontinuous Galerkin methods using 
explicit time integration. Figure |3] shows a live snapshot of the simulation 
of a wave propagation problem on a moderately complex domain as shown 
during run time by the solver if the mayavi3j visualization package is in- 
stalled. 



Figure 4(a) shows the performance of the demonstration solver developed 
here and compares it to the performance of hedge, our (freely available) 
production solver. Note that these graphs should not be compared directly, 
as one shows the result of a 2D simulation, while the other portrays the 
performance of a three-dimensional computation. With some care, a few 
observations can be made however. 



http : //code . enthought . com/pro jects/mayavi/ 
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Figure 3: Screen shot of the demonstration solver showing a hve snapshot of the simulation 
of a wave propagation problem on a moderately complex domain. 
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(a) GFlops/s plotted vs. polynomial degree for 
the 2D demonstration solver described in this 
chapter. 
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(b) GFlops/s plotted vs. polynomial degree for 
the 3D DG solver hedge. 



Figure 4: Performance figures for the demonstration solver (2D) and the full-featured (3D) 
solver 'hedge', both executing solvers for the Maxwell problem on an Nvidia GTX 295 in 
single precision. 
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It is possible to observe that, as the element-local matrices grow with the 
third power of the degree in three dimensions (as opposed to the second 
power in 2D), they constitute a larger part of the computation and con- 
tribute many more floating point operations, leading to higher performance. 
For the same reason, performance increases as the polynomial degree N 
increases. 



Many of the tuning ideas described in Section 3.3 (such as microblocking 
and matrix-in-local) are designed to help performance in the case of lower 
N, and hence, smaller matrices. Without comparing absolute numbers, we 
observe that the initial performance increase at low N is much faster in the 
production solver than in the demonstration solver. We attribute this to 
the implementation of these extra strategies, that lead to markedly better 
performance at moderate N. In [Klockner et al. 2009a| , we also briefly 



study the individual and combined effects of a few of these performance 
optimizations. 

As a final observation, we would like to remark that the performance ob- 
tained here is very close to to the performance obtained in a C-based OpenCL 
solver written for the same problem-the use of Python as an implementation 
language does not hamper the speed of the solver at all. In particular, this fa- 
cilitates a very logical splitting of computational software into performance- 
critical, low-level parts written in OpenCL C, and performance-uncritical 
set-up and administrative parts written in Python. 



5 Future Directions 

We have shown that, using our strategies, high-order DG methods can reach 
double-digit percentages of published theoretical peak performance values 
for the hardware under consideration. This computational speed translates 
directly into an increase of the size of the problem that can be reasonably 
treated using these methods. A single compute device can now do work 
that previously required a roomful of computing hardware-even using the 
simplistic implementation demonstrated here. 

It is our stated goal to further broaden the usefulness of the method through 
continued investigation of the treatment of nonlinear problems, improved 
time integration characteristics, and coupling to other discretizations to op- 
timally exploit the characteristics of each. GPUs present a rare opportunity. 
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and it is fortuitous that a method hke DG, which is known for highly accu- 
rate solutions, can benefit so tremendously from this computational advance. 
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